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Learning  from  the  Brain  with  Applications  to  Scene  Understanding 
and  to  Speech  Synthesis  and  Understanding 


Executive  Summary 

The  problem  of  learning  represents  a  gateway  to  understanding  intelligence  in  brains  and 
machines  and  to  making  intelligent  machines  that  learn  from  experience.  Since  our  brain 
represents  the  best  known  example  of  a  learning  machine,  we  have  developed  algorithm 
directly  based  on  the  architecture  of  the  cortex.  In  the  process  we  have  demonstrated  systems 
in  applications  of  interest  to  IPTO.  In  particular,  we  have  demonstrated 

o  a  system  for  scene  understanding  in  street  environments 

o  videorealistic  synthesis  of  a  speaking  agent. 

In  the  more  recent  phases  of  the  project,  we  have  explored  architectures  for  learning  and  for 
visual  recognition  process  that  have  a  deeply  hierarchical  organization  and  incorporate 
attentional  feedbacks  and  top-down  priors.  We  have  also  extended  our  demonstration  to  video 
and  the  recognition  of  actions.  In  particular,  we  have  developed 

o  A  system,  derived  form  our  model  of  visual  cortex,  for  the  automatic,  quantitative 
phenotyping  of  mouse  behavior  from  videos. 

o  A  mathematical  framework  -  a  theory  of  hierarchical  kernel  machines  —  to  characterize 
the  properties  and  limitations  of  deep  learning  networks  with  a  hierarchical  architecture 
similar  to  the  cortex. 


2 


A  Brief  History  of  the  Projects 


We  started  the  project  with  the  plan  of  a)  extending  supervised  learning  algorithms  to  the 
fundamental  ability  to  learn  from  just  a  few  examples,  and  (b)  to  apply  existing  learning 
techniques  and  their  extensions  to  an  important  application  domain  for  the  DoD,  eg  scene 
understanding  tasks  in  street  images,  using  a  database  -  that  we  created  --  of  images  of  streets 
in  Cambridge,  with  a  variety  of  objects,  from  buildings  to  stores,  people,  traffic  lights,  cars, 
trucks,  buses.  The  system  we  proposed  had  the  goal  of  describing  a  street  scene  by  identifying 
the  key  objects  in  it  and  ultimately,  by  understanding  video. 

In  addition,  we  proposed  to  develop  learning  techniques  for  multimodal  human-computer 
interfaces  (computer  graphics  and  speech  synthesis)  and  in  particular  videorealistic  synthesis  of 
a  speaking  agent.  From  short  video  segments,  we  had  developed  a  technique  for  learning  how 
a  person  speaks,  and  then  generated  a  synthetic  video  of  the  person’s  face  speaking  an 
arbitrary  segment  of  somebody  else’s  speech,  even  in  another  language.  We  had  planned  to 
extend  the  system  to  synthesize  3D  videos  and  to  synthesize  the  voice  of  a  person  by  learning 
from  a  very  small  speech  corpus. 

We  have  worked  on  extending  supervised  learning  algorithms  for  the  last  four  years  with  a 
number  of  achievements  on  the  theoretical  side.  The  most  recent  one  concern  the  development 
of  a  mathematical  framework  for  hierarchical  learning  systems,  inspired  by  the  architecture  of 
the  visual  cortex. 

We  achieved  most  of  our  goals  on  the  multimodal  human-computer  interfaces  project  at  the  end 
of  the  first  two  years  (see  Appendices). 

Our  main  project  -  of  scene  and  video  understanding  using  a  small  number  of  training  images  - 
reached  most  of  the  planned  results  after  the  first  three  years.  Afterwards,  it  successfully 
explored  new  architectures  for  learning  and  recognition  directly  derived  from  cortical 
architectures. 
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Background  and  Goals 

Developing  systems  that  can  truly  learn  from  experience,  mostly  by  themselves,  in  an 
incremental  way,  would  ultimately  be  relevant  for  many  DARPA  projects.  Our  applications  -- 
scene  understanding  and  monitoring  in  street  environments  and  videorealistic  synthesis  of  a 
speaking  agent  --  should  have  a  direct  impact  on  the  technology  of  surveillance  in  general  and 
on  electronic  disinformation  techniques. 

In  particular,  we  worked  on  Scene  understanding  and  monitoring  tasks  in  street  environments. 
We  collected  a  database  of  images  of  streets  in  Cambridge,  with  a  variety  of  objects,  from 
buildings  to  stores,  people,  traffic  lights,  cars,  trucks,  buses.  The  system  we  proposed  should 
describe  a  street  scene  by  identifying  the  key  objects  in  it.  Ultimately,  such  a  system  may 
understand  video  and  report  anomalous  events.  The  system  must  be  able  to  learn  from  a 
relatively  just  a  few  example  images  of  each  object. 

We  also  planned  to  work  on  multimodal  human-computer  interfaces  (computer  graphics  and 
speech  synthesis)  and  in  particular  videorealistic  synthesis  of  a  speaking  agent.  From  short 
video  segments,  we  had  already  developed  a  technique  for  learning  how  a  person  speaks,  and 
then  generated  a  synthetic  video  of  the  persons  face  speaking  an  arbitrary  segment  of 
somebody  else  speech,  even  in  another  language.  We  planned  to  extend  the  system  to 
synthesize  3D  videos  and  especially  the  voice  of  a  person  by  learning  from  a  small  speech 
corpus. 


Main  Accomplishments  Over  the  Course  of  the  Project 

•  Following  a  model  of  the  ventral  stream  of  visual  cortex,  we  developed  a  novel  set  of 
features  for  visual  recognition.  These  features  outperformed  state  of  the  art  features  on 
several  datasets  used  in  computer  vision  benchmarks. 

•  A  system  for  object  recognition  in  street  scenes  was  built  on  top  of  these  features.  The 
system  can  reliably  identify  several  different  object  types  such  as  cars,  bikes,  pedestrians, 
sky,  road,  buildings  and  trees.  While  the  performance  is  not  fully  satisfactory,  the  system 
outperforms  state  of  the  art  systems  that  we  implemented  for  comparison. 

•  A  novel  learning  algorithm  was  developed  for  learning  from  few  examples.  The  algorithm  is 
a  variant  of  gentleBoost  and  was  especially  designed  to  avoid  overfitting  on  small  datasets. 
The  algorithm  selects  relevant  features  and  provides  a  considerable  speed-up  for  our  object 
recognition  system.  It  was  shown  to  outperform  existing  boosting  algorithms  not  only  on  a 
variety  of  vision  datasets  but  also  on  various  genomic  datasets,  where  learning  from  few 
examples  is  also  a  concern. 

•  We  completed  the  development  of  hierarchical  feedforward  architecture  for  object 
recognition  based  on  the  anatomy  and  the  physiology  of  the  visual  cortex,  and  showed  that 
the  resulting  performance  on  several  databases  of  complex  images  is  as  good  as  or  better 
than  the  best  available  computer  vision  systems  {2007  PAM!  paper). 

•  We  have  developed  the  notion  of  ’’audio  flow”,  which  is  inspired  by  the  notion  of  "optical 
flow”  from  computer  vision,  and  which  models  the  shifting  which  occurs  in  the  formants 
during  speech.  Audio  flow  defines  the  correspondence  from  one  vocal  tract  filter  to  another. 
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•  We  have  successfully  created  a  morphable  model  of  the  vocal  tract  filter  space,  in  which  any 
filter  is  viewed  as  a  "morphed"  combination  of  prototype  filters  extracted  from  a  small  20 
second  corpus.  In  the  morphable  model  we  define  audio  flow  between  60-80  prototype 
filters.  Any  novel  vocal  tract  filter  is  modeled  as  a  morph  between  those  60-80  prototype 
filters 

•  We  have  also  successfully  created  a  vector  space  for  the  excitation  signal,  in  which  any 
pitch  period  in  the  excitation  signal  is  viewed  as  a  linear  combination  of  prototype  pitch 
periods  extracted  from  a  small  10  sec  corpus. 

•  We  developed  the  feedforward  path  of  a  new  architecture  for  object  recognition  based  on 
the  anatomy  and  the  physiology  of  the  visual  cortex. 

•  We  showed  that  the  resulting  performance  on  complex  imagery  outperforms  state-of-the-art 
vision  systems. 

•  We  also  showed  for  the  first  time  that  a  neurobiological  model  of  cortex  does  as  well  as 
humans  and  better  than  state-of-the-art  computer  vision  systems  on  a  challenging,  natural 
image  recognition  task  (2007  PNAS  paper).. 

•  We  have  collected  a  database  of  images  of  streets  in  Cambridge,  with  a  variety  of  objects, 
from  buildings  to  stores,  people,  traffic  lights,  cars,  trucks,  buses  and  completed  a  first 
version  of  a  system  capable  of  scene  understanding  in  such  a  domain  (street  images).  See 
http://cbcl.mit.edu/software-datasets/streetscenes/ 

•  We  have  obtained  preliminary  results  of  speech  synthesis  from  very  short  training 
sequences.  Separately,  we  improved  a  system  for  learning  how  a  person  speaks,  and  for 
then  generating  a  synthetic  video  of  the  persons  face  speaking  an  arbitrary  segment  of 
somebody  else  speech,  even  in  another  language. 

•  We  investigated  the  use  of  morphable  models  for  audio  synthesis,  which  enabled  us  to 
synthesize  a  voice  from  a  very  small  audio  corpus  (<  1  minute).  A  crucial  component  in 
achieving  this  goal  was  to  develop  a  representation  of  speech  that  is  smooth  and  which  can 
accurately  reconstruct  speech.  If  the  representation  is  smooth,  it  may  then  be  placed  in  a 
morphable  model  framework,  and  we  can  then  morph  segments  of  speech  to  produce  new 
realistic  utterances  from  very  small  amounts  of  data. 

•  We  have  made  significant  progress  in  this  regard;  in  particular,  we  have  developed  a  novel 
representation  called  Max-Gabor  analysis,  which  produces  a  smooth  representation  of 
speech.  This  analysis  is  inspired  mainly  by  the  work  of  Shamma  and  colleagues,  who  have 
developed  a  two-stage  auditory  model  based  on  psycho-acoustical  and  neurophysiological 
findings  in  the  early  and  central  stages  of  the  auditory  pathway.  Also  this  representation 
borrows  from  the  work  of  Riesenhuber  and  Poggio,  who  developed  a  model  of  object 
recognition  in  visual  cortex  by  embedding  a  MAX  operator  in  a  hierarchical  neural  model. 

•  Max-Gabor  analysis  works  by  analyzing  small  spectro-temporal  patches  P  of  a  two- 
dimensional  magnitude  spectrogram  S(f,t),  and  representing  each  patch  by  its  locally 
dominant  spectro-temporal  periods  T(f,t)  and  orientations  Theta(f,t).  The  method  also 
estimates  local  patch  amplitudes  A(f,t)  and  phases  Phi(f,t)  as  well.  Since  the  local  patches 
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are  Gabor-like,  Max-Gabor  analysis  operates  by  performing  a  two-dimensional  Gabor-like 
analysis  of  the  spectrogram,  retaining  only  the  parameters  of  the  2D-Gabor  filter  with 
maximal  amplitude  response  within  the  local  region.  Hence  we  call  our  technique  a  Max- 
Gabor  analysis  of  spectrograms.  Given  the  estimated  local  periods  T(i,j),  orientations 
Theta(i,j), amplitudes  A(i,j),  and  phases  Phi(i,j),  the  spectrogram  S(f,t)  can  be  reconstructed 
by  synthesizing  individual  local  2D  Gabors  Gij(f,t)  for  each  patch,  and  overlap-adding  them 
together. 

•  We  analyzed  and  re-synthesized  several  test  utterances  of  different  speakers  uttering  the 
phrase  “Hi  Jane".  Our  results  may  be  seen  on  the  web  at 
http://cuneus.ai. mit.edu’.SOOO/research/  maxgabor.  In  general,  we  have  found  that  the  Max- 
Gabor  parameters  are  smooth,  meaningful,  and  capable  of  reconstructing  the  original 
spectrogram.  Our  future  goal  is  to  see  if  it  is  possible  to  morph  the  Max-Gabor  parameters 
to  generate  novel  speech. 

•  We  have  extended  the  model  of  the  ventral  stream  to  incorporate  neuroscience  data  on 
backprojections  and  control  of  attention  and  eye  movements  in  collaboration  with  Bob 
Desimone  (McGovern  Institute  and  BCS).  Preliminary  results  show  that  this  extended  model 
can  predict  human  eye  movements  in  top-down  tasks  better  than  other  standard  models  of 
saliency. 

•  We  are  developing  -  with  neuroscience  details  —  an  extension  of  the  model  to  the  dorsal 
stream  for  the  recognition  of  actions. 

•  We  have  used  the  system  above  to  phenotype  mice  behavior  -  developing  a  vision  system 
that  could  be  developed  into  a  useful  tool  for  biologists.  We  have  a  prototype  system  that  we 
will  test  in  several  labs  at  MIT  and  the  Broad  Institute  working  with  mutant  mice  as  models  of 
mental  and  neurological  diseases. 
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•  Evolutionary  Portrait  Art:  Tomaso  Poggio  by  Gunter  Bacheiier:  Portraits  of  the  masterminds  are 
transformed... 

•  Technology  Review  --  By  Anne-Marie  Corley,  A  Robot  that  Navigates  Like  a  Person:  A  new  robot 
navigates  using  humanlike  visual  processing  and  object  detection..  Tuesday,  June  30,  2009 

•  Biomedical  Computation  Review  -  by  Roberta  Freidman,  PhD,  “Reverse  Engineering  the  Brain”. 
Volume  5,  Issue  2,  pages  10-17,  ISSN  1557-3192,  Spring  2009 

•  LaStampa.it  Tecnologia  News,  Tomaso  Poggio,  uno  dei  padri  della  neuroscienza  e  professore  al 
MIT  di  Boston:  “I  robot  non  saranno  una  minaccia  per  almeno  altri  10  anni.  Qggi  e  Google  un 

potenziale  pericolo'*  (translation:  Tomaso  Poggio,  one  of  the  fathers  of  the  neuroscience  and 
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•  Masterminds  of  Artificial  Intelligence  -  Evolutionary  Portrait  Art  by  Gunter  Bacheiier,  Janaury  2009 

•  PC  Magazine:  Future  Watch:  Understanding  the  Brain.  August  2008 

•  TERRA  ACTU ALIDAD  -  INTERNACIONAL  -  Tecnalia  desarrolla  en  Massachussets  la  tesis  doctoral 
de  una  investioadora  vasca  sobre  biologia  informatica  v  neurologia 

•  MIT  NEWS  -  by  David  Chandler  Learning  about  brains  from  computers,  and  vice  versa:  -  Tomaso 
Poggio. 
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Appendices:  details  on  some  of  the  accomplishments 


1.  Model  of  the  ventral  stream  of  visual  cortex  for  object  recognition 

We  have  been  using  the  quantitative  model  of  visual  cortex  for  object  recognition  tasks.  The 
model  achieves  outstanding  performance  on  a  detection  task  involving  a  variety  of  object 
categories  and  simultaneously  enables  learning  from  only  a  few  training  examples.  The 
resulting  system  outperforms  state-of-the-art  systems  over  a  variety  of  object  image  data  sets 
from  different  groups.  The  model  detects  both  the  large  amorphous  objects  (trees,  sky, 
buildings  and  road),  and  the  rigid  objects  (cars,  pedestrians,  bikes).  On  both  of  these  tasks  the 
system  outperforms  other  systems  on  a  very  large  and  challenging  data  set  we  collected  as  a 
benchmark.  Moreover,  the  features  used  by  the  model  are  excellent  features  for  learning  from 
only  few  training  examples.  Table  1  summarizes  the  comparisons  we  performed  between  the 
model  and  other  state-of-the-art  computer  vision  systems  on  datasets  from  several  groups 
(including  our  own  StreetScene  Database). 

The  model  fits  well  neuroscience  data. 


This  new  set  of  features  is  indeed  qualitatively  and  quantitatively  consistent  with  several 
properties  of  cells  in  V1,  V2,  V4,  and  IT,  PFC  as  well  as  several  fMRI  and  psychophysical  data. 
For  instance,  the  model  predicts,  at  the  C1  and  C2  levels  respectively,  the  max-like  behavior  of 
a  subclass  of  complex  cells  in  V1  and  V4.  It  also  agrees  with  other  data  in  V4  (Reynolds  et  a/., 
1999)  about  the  response  of  neurons  to  combinations  of  simple  two-bar  stimuli  (within  the 
receptive  field  of  the  S2  units)  and  some  of  the  C2  units  in  the  model  show  a  tuning  for 
boundary  conformations  (Pasupathy  &  Connor,  2001)  which  is  consistent  with  recordings  from 
V4  (Serre  et  a/.,  2005).  Read-out  from  C2b  units  in  the  model  predicted  (Serre  et  a/.,  2005) 
recent  read-out  experiments  in  IT  (Hung  et  a/.,  2005),  showing  very  similar  selectivity  and 
invariance  for  the  same  set  of  stimuli. 

The  model  mimics  human  performance  on  a  challenging  detection  task. 


The  new  set  of  features,  when  used  to  classify  between  animal  and  non-animal  images 
performed  at  the  level  of  human  observers  (with  rapid  presentations).  The  model  was  shown  to 
predict  the  pattern  of  performance  of  human  observers  on  different  animal  subcategories  (see 
Fig.  1).  Additionally  we  found  that  both  the  model  and  human  observers  tend  to  produce  similar 
responses  (both  correct  and  incorrect).  The  overall  image-by-image  correlation  between  the 
model  and  human  observers  is  high  (specifically  0.71,  0.84,  0.71  and  0.60  for  heads,  close- 
body,  medium-body  and  far-body  respectively,  with  p  value  p  <  0.01).  Finally  we  found  that 
surprisingly  the  model  and  human  observers  exhibit  a  similar  robustness  to  image  orientation 
(90°  rotation  and  inversion). 

The  modeTs  implementation  has  been  improved  significantly. 

In  the  last  year  many  improvements  to  the  model  occurred.  These  include  more  effective 
representation  at  the  higher  level  of  the  model,  the  addition  of  image  descriptors  that  capture 
important  gestalt  properties  and  a  significant  speedup  (see  slide  #).  In  the  last  year  the  model 
improved  significantly  in  accuracy  while  achieving  a  considerable  drop  in  run  time. 
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Table  1 :  Summary  of  the  comparisons  performed  between  the  model  and  other  computer  vision 
systems.  For  all  comparisons,  all  systems  were  trained  and  tested  on  the  same  sets. 
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Figure  1:  Comparison  between  the  model  and  human  observers  on  a  rapid  animal  vs.  non-animal 
categorization  task,  (left)  The  stimulus  dataset,  i.e.,  four  animal  subcategories  with  matching  distractors. 
(right)  Performance  of  the  model  and  human  observers.  The  error  measure  reported  is  the  d’  which  is  a 
sensitivity  measure  that  combines  both  the  hit  and  false-alarm  rates  of  each  observer  into  one 
standardized  score. 


2.  Comparison  with  physiological  observations 

The  quantitative  implementation  of  the  model  allows  for  direct  comparisons  between  the 
responses  of  units  in  the  model  and  electrophysiological  recordings  from  neurons  in  the  visual 
cortex.  Here  we  illustrate  this  approach  by  directly  comparing  the  model  against  recordings  from 
the  macaque  monkey  area  V4  and  inferior  temporal  cortex  while  the  animal  was  passively 
viewing  complex  images. 

The  model  includes  several  layers  that  are  meant  to  mimic  visual  areas  V1,  V2,  V4  and  IT 
cortex.  We  directly  compared  the  responses  of  the  model  units  against  electrophysiological 
recordings  obtained  throughout  all  these  visual  areas.  The  model  is  able  to  account  for  many 
physiological  observations  in  early  visual  areas.  For  instance,  at  the  level  of  V1,  model  units 
agree  with  the  tuning  properties  of  cortical  cells  in  terms  of  both  frequency  and  orientation 
bandwidth,  as  well  as  peak  frequency  selectivity  and  receptive  field  sizes  (see  (Serre  and 
Riesenhuber,  2004)).  Also  in  V1,  we  observe  that  model  units  in  the  C1  layer  can  explain 
responses  of  a  subpopulation  of  complex  cells  obtained  upon  presenting  two  oriented  bars 
within  the  receptive  field  (LampI  et  al.,  2004).  At  the  level  of  V4,  model  C2  units  exhibit  tuning  for 
complex  gratings  (based  on  the  recordings  from  (Gallant  et  al.,  1996)),  and  curvature  (based  on 
(Pasupathy  and  Connor,  2001)),  as  well  as  interactions  of  multiple  dots  (based  on  (Freiwald  et 
al.,  2005))  or  the  simultaneous  presentation  of  two-bar  stimuli  (based  on  (Reynolds  et  al.,  1999), 
see  (Serre  et  al.,  2005)  for  details). 

Here  we  focus  on  one  comparison  between  C2  units  and  the  responses  of  V4  cells.  Figure  3 
shows  the  side-by-side  comparison  between  a  model  C2  unit  and  V4  cell  responses  to  the 
presentation  of  one-bar  and  two-bar  stimuli.  As  in  (Reynolds  et  al.,  1999)  model  units  were 
presented  with  either  1)  a  reference  stimulus  alone  (an  oriented  bar  at  position  1,  see  Figure 
3A),  2)  a  probe  stimulus  alone  (an  oriented  bar  at  position  2)  or  3)  both  a  reference  and  a  probe 
stimulus  simultaneously.  We  used  stimuli  of  16  different  orientations  for  a  total  of  289  =  (16  +  1)^ 
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total  stimulus  combinations  for  each  unit  (see  (Serre  et  al.,  2005)  for  details).  Each  unit's 
response  was  normalized  by  the  maximal  response  of  the  unit  across  all  conditions.  As  in 
(Reynolds  et  al.,  1999)  we  computed  a  selectivity  index  as  the  normalized  response  of  the  unit 
to  the  reference  stimulus  minus  the  normalized  response  of  the  unit  to  one  of  the  probe  stimuli. 
This  index  was  computed  for  each  of  the  probe  stimuli,  yielding  16  selectivity  values  for  each 
model  unit.  This  selectivity  index  ranges  from  -1  to  +1,  with  negative  values  indicating  that  the 
reference  stimulus  elicited  the  stronger  response,  a  value  of  0  indicating  identical  responses  to 
reference  and  probe,  and  positive  values  indicating  that  the  probe  stimulus  elicited  the  strongest 
response.  We  also  computed  a  sensory  interaction  index  which  corresponds  to  the  normalized 
response  to  a  pair  of  stimuli  (the  reference  and  a  probe)  minus  the  normalized  response  to  the 
reference  alone.  The  selectivity  index  also  takes  on  values  from  -1  to  +1.  Negative  values 
indicate  that  the  response  to  the  pair  is  smaller  than  the  response  to  the  reference  stimulus 
alone  (i.e.,  adding  the  probe  stimulus  suppresses  the  neuronal  response).  A  value  of  0  indicates 
that  adding  the  probe  stimulus  has  no  effect  on  the  neuron’s  response  while  positive  values 
indicate  that  adding  the  probe  increases  the  neuron’s  response. 

As  shown  in  figure  3B,  model  C2  units  and  V4  cells  behave  very  similarly  to  the  presentation  of 
two  stimuli  within  their  receptive  field.  Indeed  the  slope  of  the  selectivity  vs.  sensory  interaction 
indices  is  about  0.5  for  both  model  units  and  cortical  cells.  That  is,  at  the  population  level, 
presenting  a  preferred  and  a  non-preferred  stimulus  together  produces  a  neural  response  that 
falls  between  the  neural  responses  to  the  two  stimuli  individually,  sometimes  close  to  an 
average.1  We  have  found  that  such  a  “clutter  effect"  also  happens  higher  up  in  the  hierarchy  at 
the  level  of  IT,  see  (Serre  et  al.,  2005).  Since  normal  vision  operates  with  many  objects 
appearing  within  the  same  receptive  fields  and  embedded  in  complex  textures  (unlike  the 
artificial  experimental  setups),  understanding  the  behavior  of  neurons  under  clutter  conditions  is 
important  and  warrants  more  experiments  (see  later  section  3.2.4  and  section  4.2).  In  sum,  the 
model  can  capture  many  aspects  of  the  physiological  responses  of  neurons  along  the  ventral 
visual  stream  from  VI  to  IT  cortex  (see  also  (Serre  et  al.,  2005)). 
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Figure  2:  A  quantitative  comparison  between  model  C2  units  and  V4  cells.  A)  Stimulus  configuration 
(modified  from  Figure  1A  in  (Reynolds  et  al.,  1999)):  The  stimulus  In  position  1  is  denoted  as  the 
reference  and  the  stimulus  in  position  2  as  the  probe.  As  in  (Reynolds  et  al.,  1999)  we  computed  a 
selectivity  index  (which  indicates  how  selective  a  cell  is  to  an  isolated  stimulus  In  position  1  vs.  position  2 
alone)  and  a  sensory  interaction  index  (which  indicates  how  selective  the  cell  is  to  the  paired  stimuli  vs. 
the  reference  stimulus  alone),  see  text  and  (Serre  et  al.,  2005)  for  details.  B)  Side  by-side  comparison 
between  V4  neurons  (left,  adapted  from  Fig.  5  In  (Reynolds  et  al.,  1999))  while  the  monkey  attends  away 
from  the  receptive  field  location  and  C2  units  (right).  Consistent  with  the  physiology,  the  addition  of  a 
second  stimulus  in  the  receptive  field  of  the  C2  unit  moves  the  response  of  the  unit  toward  that  of  the 
second  stimulus  alone,  I.e.,  the  response  to  the  clutter  condition  lies  between  the  responses  to  the 
individual  stimuli. 
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(a)  OneVill^uron 


(b)  One  model  Qi  unit 


Figure  3:  A  comparison  between  the  response  of  a  single  V4  neuron  (corresponding  to  Fig.  4A  in  (Pasupathy  and 
Connor,  2001))  (a)  and  a  single  model  C2  unit  (b)  over  the  boundary  conformation  stimulus  set.  The  gray  level  of 
the  stimulus  background  indicates  the  response  magnitude  to  each  stimulus  (the  darker  the  shading  the  stronger 
the  response).  The  model  unit  was  picked  from  the  population  of  109  model  C2  units  under  study.  Both  units 
exhibit  very  similar  pattern  of  responses  (overall  correlation  r  =  0.78).  The  fit  between  the  model  unit  and  the  V4 
neuron  Is  quiet  remarkable  given  that  there  was  no  fitting  procedure  involved  here  for  learning  the  weights  of  the 
model  unit:  The  unit  was  simply  selected  from  a  small  population  of  109  model  units  learned  from  natural  images 
and  selected  at  random.  The  inset  on  the  lower  right  end  of  the  figure  at  the  bottom  describes  the  corresponding 
receptive  field  organization  of  the  C2  unit.  Each  oriented  ellipse  characterizes  one  subfield  at  matching 
orientation.  Color  encodes  for  the  strength  of  the  connection  between  the  subfield  and  the  unit. 
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Decoding  object  information  from  IT  and  model  units 


We  recently  used  a  simple  linear  statistical  classifier  to  quantitatively  show  that  we  could 
accurately,  rapidly  and  robustly  decode  visual  information  about  objects  from  the  activity  of 
small  populations  of  neurons  in  anterior  inferior  temporal  cortex  (Hung  et  al.,  2005).  In 
collaboration  with  Chou  Hung  and  James  DiCarlo  at  MIT,  we  observed  that  a  binary  response 
from  the  neurons  (using  small  bins  of  12.5  ms  to  count  spikes)  was  sufficient  to  encode 
information  with  high  accuracy.  This  robust  visual  information,  as  measured  by  our  classifiers, 
could  in  principle  be  decoded  by  the  targets  of  IT  cortex  such  as  prefrontal  cortex  to  determine 
the  class  or  identity  of  an  object  (Miller,  2000).  Importantly,  the  population  response  generalized 
across  object  positions  and  scales.  This  scale  and  position  invariance  was  evident  even  for 
novel  objects  that  the  animal  never  observed  before  (see  also  (Logothetis  et  al.,  1995)).  The 
observation  that  scale  and  position  invariance  occurs  for  novel  objects  strongly  suggests  that 
these  two  forms  of  invariance  do  not  require  multiple  examples  of  each  specific  object.  This 
should  be  contrasted  with  other  forms  of  invariance,  such  as  robustness  to  depth  rotation,  which 
requires  multiple  views  in  order  to  be  able  to  generalize  (Poggio  and  Edelman,  1990). 

We  examined  the  responses  of  the  model  units  to  the  same  set  of  77  complex  object  images 
seen  by  the  monkey.  These  objects  were  divided  into  8  possible  categories.  The  model  unit 
responses  were  divided  into  a  training  set  and  a  test  set.  We  used  a  one-versus-all  approach, 
training  8  binary  classifiers,  one  for  each  category  against  the  rest  of  the  categories,  and  then 
taking  the  classifier  prediction  to  be  the  maximum  among  the  8  classifiers  (for  further  details, 
see  (Hung  et  al.,  2005;  Serre  et  al.,  2005)).  Similar  observations  were  made  when  trying  to 
identify  each  individual  object  by  training  77  binary  classifiers.  For  comparison,  we  also  tried 
decoding  object  category  from  a  random  selection  of  model  units  from  other  layers  of  the  model. 
The  input  to  the  classifier  consisted  of  the  responses  of  randomly  selected  model  units  and  the 
labels  of  the  object  categories  (or  object  identities  for  the  identification  task).  Data  from  multiple 
units  were  concatenated  assuming  independence. 

We  observed  that  we  could  accurately  read  out  the  object  category  and  identity  from  model 
units.  In  Figure  3A,  we  compare  the  classification  performance,  for  the  categorization  task 
described  above,  between  the  IT  neurons  and  the  C2b  model  units.  In  agreement  with  the 
experimental  data  from  IT,  units  from  the  C2b  stage  of  the  model  yielded  a  high  level  of 
performance  (>  70%  for  100  units;  where  chance  was  12.5%).  We  observed  that  the 
physiological  observations  were  in  agreement  with  the  predictions  made  by  the  highest  layers  in 
the  model  (C2b,  S4)  but  not  by  earlier  stages  (SI  through  S2).  As  expected,  the  layers  from  SI 
through  S2  showed  a  weaker  degree  of  scale  and  position  invariance. 

The  classification  performance  of  S2b  units  (the  input  to  C2b  units,  see  Figure  1)  was 
qualitatively  close  to  the  performance  of  local  field  potentials  (LFPs)  in  IT  cortex  (Kreiman  et  al., 
2006).  The  main  components  of  LFPs  are  dendritic  potentials  and  therefore  LFPs  are  generally 
considered  to  represent  the  dendritic  input  and  local  processing  within  a  cortical  area  (Mitzdorf, 
1985;  Logothetis  et  al.,  2001).  Thus,  it  is  tempting  to  speculate  that  the  S2b  responses  in  the 
model  capture  the  type  of  information  conveyed  by  LFPs  in  IT.  However,  care  should  be  taken 
in  this  interpretation  as  the  LFPs  constitute  an  aggregate  measure  of  the  activity  over  many 
different  types  of  neurons  and  large  areas.  Further  investigation  of  the  nature  of  the  LFPs  and 
their  relation  with  the  spiking  responses  could  help  unravel  the  transformations  that  take  place 
across  cortical  layers. 

The  pattern  of  errors  made  by  the  classifier  indicates  that  some  groups  were  easier  to 
discriminate  than  others.  This  was  also  evident  in  the  correlation  matrix  of  the  population 
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responses  between  all  pairs  of  pictures  (Serre  et  al.,  2005;  Hung  et  al.,  2005).  The  units  yielded 
similar  responses  to  stimuli  that  looked  alike  at  the  pixel  level.  The  performance  of  the  classifier 
for  categorization  dropped  significantly  upon  arbitrarily  defining  the  categories  as  random 
groups  of  pictures. 

We  also  tested  the  ability  of  the  model  to  generalize  to  novel  stimuli  not  included  in  the  training 
set.  The  performance  values  shown  in  Figure  3  are  based  on  the  responses  of  model  units  to 
single  stimulus  presentations  that  were  not  included  in  the  classifier  training  and  correspond  to 
the  results  obtained  using  a  linear  classifier.  Although  the  way  in  which  the  weights  were 
learned  (using  a  support  vector  machine  classifier)  is  probably  very  different  in  biology  (see 
(Serre,  2006)),  once  the  weights  are  established  the  linear  classification  boundary  could  very 
easily  be  implemented  by  neuronal  hardware.  Therefore,  the  recognition  performance  provides 
a  lower  bound  to  what  a  real  downstream  unit  (e.g.,  in  RFC)  could,  in  theory,  perform  on  a 
single  trial  given  input  consisting  of  a  few  spikes  from  the  neurons  in  IT  cortex. 

Overall,  we  observed  that  the  population  of  C2b  model  units  yields  a  read-out  performance  level 
that  is  very  similar  to  the  one  observed  from  a  population  of  IT  neurons. 
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Figure  4:  Classification  performance  based  on  the  spiking  activity  from  IT  neurons  (black)  and  C2b  units 
from  the  model  (gray).  The  performance  shown  here  is  based  on  the  categorization  task  where  the 
classifier  was  trained  based  on  the  category  of  the  object.  A  linear  classifier  was  trained  using  the 
responses  to  the  77  objects  at  a  single  scale  and  position  (shown  for  one  object  by  “TRAIN"),  The 
classifier  performance  was  evaluated  using  shifted  or  scaled  versions  of  the  same  77  objects  (shown  for 
one  object  by  “TEST”).  During  training,  the  classifier  was  never  presented  with  the  unit  responses  to  the 
shifted  or  scaled  objects.  The  left-most  column  shows  the  performance  for  training  and  testing  on 
separate  repetitions  of  the  objects  at  the  same  standard  position  and  scale  (this  is  shown  only  for  the  IT 
neurons  because  there  is  no  variability  in  the  model  which  is  deterministic).  The  second  bar  shows  the 
performance  after  training  on  the  standard  position  and  scale  (3.4  degrees,  center  of  gaze)  and  testing  on 
the  shifted  and  scaled  images.  The  dashed  horizontal  line  indicates  chance  performance  (12.5%,  1  out  of 
8  possible  categories).  Error  bars  show  standard  deviations  over  20  random  choices  of  the  units  used  for 
training/testing. 


27 


3.  Comparison  between  the  model  and  other  state-of-the-art  computer  vision 
systems 

CalTech-101 


We  compared  the  model  to  the  SIFT  features  (Lowe  1999;  Lowe  2004)  on  the  CalTech-101 
database.  As  illustrated  on  Fig.  4,  the  model  C2  features  exhibit  higher  performance. 


SIFT-ba$«d  features  performance  (equHIbrhim  polm) 


(a)  (b) 

Figure  5:  Comparison  with  SiFT  features  on  the  calTech-101. 


MIT  face  and  car  database 


We  compared  the  performance  of  the  C2  units  to  two  computer  vision  systems  that  were 
developed  in  the  lab.  The  two  benchmarks  are  also  hierarchical  (a  first  layer  of  SVM  classifiers 
detect  object  components  and  a  second  layer  check  for  their  configuration).  Model  C2  units 
outperform  both  systems. 


Datasets 

Benchmark 

Model 

MIT-CBCL  faces  (Heisele  et  al 
2002) 

90.4%  correct 

95.9%  correct 

MIT-CBCL  cars  (Leung  2004) 

75.4%  correct 

95.1%  correct 
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StreetScene  database 


Rigid-objects: 

For  comparison,  we  also  implemented  four  other  benchmark  systems.  Our  most  simple  baseline 
detector  is  a  single-template  Grayscale  system;  Each  image  is  normalized  in  size  and  histogram 
equalized  before  the  gray-values  are  passed  to  a  linear  classifier  (gentleBoost).  Another 
baseline  detector,  Local  Patch  Correlation,  is  built  using  patch-based  features  similar  to  [45]. 
Each  feature  fi  is  associated  with  a  particular  image  patch  pi,  extracted  randomly  from  the 
training  set.  Each  feature  fi  is  calculated  in  a  test  image  as  the  maximum  normalized  cross 
correlation  of  pi  within  a  subwindow  of  the  image.  This  window  of  support  is  equal  to  a  rectangle 
three  times  the  size  of  pi  and  centered  in  the  image  at  the  same  relative  location  from  which  pi 
was  originally  extracted.  The  advantage  of  the  patch-based  features  over  the  single-template 
approach  is  that  local  patches  can  be  highly  selective  while  maintaining  a  degree  of  position 
invariance.  The  system  was  implemented  with  N  =  1,024  features  and  with  patches  of  size  12  X 
12  in  images  of  size  128  X  128.  The  third  benchmark  system  is  a  Part-based  system  as 
described  in  (Leibe  et  al,  2004).  Briefly,  both  object  parts  and  a  geometric  model  are  learned  via 
image  patch  clustering.  The  detection  stage  is  performed  by  redetecting  these  parts  and 
allowing  them  to  vote  for  objects-at-poses  in  a  generalized  Hough  transform  framework.  Finally, 
we  compare  to  an  implementation  of  the  Histogram  of  Gradients  (HoG)  feature  of  (Dalai  & 
Triggs,  2005),  which  has  shown  excellent  performance  on  these  types  of  objects.  All  benchmark 
systems  were  trained  and  tested  on  the  same  data  sets  as  the  SMFs-based  system.  They  all 
use  gentleBoost  except  (Leibe  et  al,  2004). 

The  ROC  results  of  this  experiment  are  illustrated  in  Fig.  5.  For  the  two  (Cl  and  C2)  SMFs- 
based  systems,  the  Grayscale  as  well  as  the  Local  Patch  Correlation  system,  the  classifier  is 
GentleBoost,  but  we  found  very  similar  results  with  both  a  linear  and  a  polynomial-kernel  SVM. 
Overall,  for  all  thre  object  categories  tested,  the  SMFs-based  system  performs  best  on  cars  and 
bicycles  and  second  behind  HoG  on  pedestrians  (the  HoG  system  was  parameter-tuned  in 
(Dalai  &  Triggs,  2005)  to  achieve  maximal  performance  on  this  one  class).  Finally,  for  this 
recognition  task,  i.e.,  with  a  windowing  framework,  the  Cl  SMFs  seem  to  be  superior  to  the  C2 
SMFs. 

Textured-obiects: 

We  implemented  four  benchmark  texture  classification  systems.  The  Blobworld  (BW)  system 
was  constructed  as  described  in  (Carson  et  al,  1999.)  Briefly,  the  Blobworld  feature,  originally 
designed  for  image  segmentation,  is  a  six-dimensional  vector  at  each  pixel  location;  three 
dimensions  encode  color  in  the  Lab  color  space  and  three  dimensions  encode  texture  using  the 
local  spectrum  of  gradient  responses.  We  did  not  include  the  color  information  for  a  fair 
comparison  between  all  the  various  texture  detection  methods. 

The  systems  labeled  T1  and  T2  are  based  on  (Renninger  &  Malik,  2004).  In  these  systems,  the 
test  image  is  first  processed  with  a  number  of  predefined  filters.  T1  uses  36  oriented  edge-filters 
arranged  in  five  degrees  increments  from  0  degrees  to  180  degrees.  T2  follows  (Renninger  & 
Malik,  2004)  exactly  by  using  36  Gabor  filters  at  six  orientations,  three  scales,  and  two  phases. 
For  both  systems  independently,  a  large  number  of  random  samples  of  the  36-dimensional 
edge  response  images  were  taken  and  subsequently  clustered  using  k-means  to  find  100 
cluster  centroids  (i.e.,  the  textons).  The  texton  image  was  then  calculated  by  finding  the  index  of 
the  nearest  texton  to  the  filter  response  vector  at  each  pixel  in  the  response  images.  A  100- 
dimensional  texton  feature  vector  was  then  built  by  calculating  the  local  10X10  histogram  of 
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nearest  texton  indexes.  Finally,  the  Histogram  of  edges  (HoE)  system  was  built  by  simply  using 
the  same  type  of  histogram  framework,  but  over  the  local  36-dimensional  directional  filter 
responses  (using  the  filters  of  T1)  rather  than  the  texton  identity.  Here,  as  well,  learning  was 
done  using  the  gentleBoost  algorithm  (again  a  linear  SVM  produced  very  similar  results).  The 
within-class  variability  of  the  texture-objects  in  this  test  is  considerably  larger  than  that  of  the 
texture  classes  usually  used  to  test  texture-detection  systems,  making  this  task  somewhat 
different.  This  may  explain  the  relatively  poor  performance  of  some  of  these  systems  on  certain 
objects. 

As  shown  in  Fig.  6,  the  SMFs-based  texture  system  seems  to  consistently  outperform  the 
benchmarks  (BW,  T1,  T2,  and  HoE).  C2  compared  to  C1  SMFs  may  be  better  suited  to  this  task 
because  of  their  increased  invariance  properties  and  complexity. 


car  detectton  ROC  curve 


False  Positive  Rate 


^6  0. 1  0.2  0.3  ■  0.4  OlS  0.6 

False  Positive  Rate 


0.8  0.9 


1 


Figure  6:  Comparison  between  the  model  (C1  SMFs  and  C2SMFs)  and  other  state  of  the  art  systems  on 
the  MIT  StreetScene  database  for  the  recognition  of  rigid  objects. 


pedestrian  detection  ROC  curve 
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Pixelwise  Tree  Detection 


Pixelwise  Sky  Detection 


Figure:  Comparison  between  the  model  (C1  and  C2)  with  other  state  of  the  art  systems  on  the  MIT 
StreetScene  database  for  the  recognition  of  textured-objects. 
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4.  Automatic  recognition  of  actions  in  videos:  a  tooi  for  behaviorai  phenotyping 

During  the  last  year,  we  developed  a  prototype  system  for  the  recognition  of  basic  rodent 
behaviors  (see  Fig.  1  for  examples).  The  work  was  based  on  a  model  of  the  dorsal  pathway  in 
the  visual  cortex  which  also  outlines  a  computer  implementation  and  its  state-of-the-art 
performance  in  the  recognition  of  human  actions. 

With  the  McGovern  funding,  we  were  able  to  collect  about  100  hours  of  mouse  monitoring  to 
train  and  test  the  system.  We  video  recorded  singly  housed  mice  from  an  angle  perpendicular  to 
the  side  of  the  cage  (Figure  below).  In  order  to  train  a  robust  detection  system,  we  used  at  least 
six  different  camera  angles  in  our  training  (and  test)  set,  all  of  which  had  slightly  different  lighting 
conditions.  In  addition,  we  utilized  mice  of  different  size,  gender,  and  coat  color.  Several 
summer  students  manually  annotated  these  videos.  Table  1  gives  a  comparison  between  the 
performance  of  our  system  and  comparisons  with  human  labelers  and  with  the  Clever  Sys. 
Commercial  system.  Our  system  achieves  near-human  level  performance  on  this  task  and  is 
significantly  better  than  an  existing  commercial  system.  A  demo  of  the  system  can  be  found  at 
http://techtv.mit.edu/videos/1838. 


Figure  1 ;  Snapshots  taken  from  representative  videos  for  8  types  of  behavior  that  the  system  was  trained 
to  recognize. 


Our  system  Clever  Sys.  Inter-human 

Commercial  system  agreement 

Performance  71.0  56.0  71.6 


Table  1:  Performance  of  the  system  (percent  frames  correctiy  ciassified)  and  comparison  with  an 
avaiiabie  commercial  system  and  human  as  measured  by  the  agreement  on  the  labeling  performed  by 
two  independent  labelers. 
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5.  Hierarchical  Kernel  Machines 


We  have  developed  a  mathematical  framework  to  analyze  hierarchical  kernel  machines 
motivated  by  the  architecture  of  the  primate  visual  cortex.  The  main  motivations  for  the  project 
are  two:  1)  primates  seem  to  be  able  to  learn  complex  tasks  from  far  fewer  examples  than  our 
present  non-hierarchical  kernel-based  learning  algorithms  predict,  e.g.  they  are  able  to  solve 
the  “poverty  of  stimulus”  problem  2)  a  preliminary  computational  model  at  MIT  of  the 
feedforward  flow  of  information  in  visual  cortex  performs  well  on  difficult  recognition  tasks 
compared  to  existing  computer  vision  systems.  What  is  needed  now  is  an  approach  based  on  a 
mathematical  theory  -  to  explain  why  a  hierarchy  is  needed,  under  which  conditions,  and  to 
provide  a  framework  for  optimizing  the  learning  architecture  and  its  parameters.  A  theory,  such 
as  the  one  of  which  we  have  the  foundations,  will  explain  why  hierarchical  models  work  as  well 
as  they  do  and  what  the  computational  reasons  for  the  hierarchical  organization  of  cortex  are, 
leading  to  potentially  significant  contributions  to  outstanding  challenges  in  machine  learning  and 
computer  science.  The  development  of  new  powerful  learning  techniques  such  as  the 
hierarchical  kernel  machines  we  propose  can  be  of  pervasive  importance  for  many  capabilities 
of  the  DoD,  because  machine  learning  is  becoming  the  common  mathematics  language  across 
different  areas  of  computer  science  and  because  of  the  reliance  of  the  DoD  on  computers  and 
algorithms.  Our  theory  should  be  relevant  to  help  preprocess  and  interpret  the  huge  flow  of 
electronic  information  --  such  as  images  --  provided  by  different  types  of  sensors.  Navigation, 
surveillance  and  intelligence  are  just  three  of  the  areas  that  could  be  hugely  impacted  by  the 
development  of  novel  learning  techniques,  inspired  by  the  brain  and  based  on  solid 
mathematical  foundations. 
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